Week 2: Potential Outcomes and Experiments

PS 813 - Causal Inference

Anton Strezhnev

University of Wisconsin-Madison

January 26, 2026

\[ \require{cancel} \]

This week

  • Defining causal estimands
    • The “potential outcomes” model of causation
  • Causal identification
    • Linking causal estimands to observable quantities
  • Randomized experiments as a solution to the identification problem
    • Treatment assignment is independent of the potential outcomes
  • Statistical inference for completely randomized experiments
    • Neyman’s approach
    • Fisher’s approach

The potential outcomes model

Thinking about causal effects

  • Two types of causal questions (Gelman and Rubin, 2013)

  • Causes of effects

    • What are the factors that generate some outcome \(Y\)?
    • “Why?” questions: Why do states go to war? Why do politicians get re-elected?
  • Effects of causes

    • If \(X\) were to change, what might happen to \(Y\)?
    • “What if?” questions: If a politician were an incumbent, would they be more likely to be re-elected compared to if they were a non-incumbent?
  • Our focus in this class is on effects of causes

    • Why? We can connect them to well-defined statistical quantities of interest (e.g. an “average treatment effect”)
    • “Causes of effects” are still important questions, but they’re more questions of theory

Defining a causal effect

  • Historically, causality was seen as a deterministic process.
    • Hume (1740): Causes are regularities in events of “constant conjunctions”
    • Mill (1843): Method of difference
  • This became problematic – empirical observation alone does not demonstrate causality.
    • Russell (1913): Scientists aren’t interested in causality!
  • How do we talk about causation that both incorporates uncertainty in measurement and clearly defines what we mean by a “causal effect”?

The potential outcomes model

  • Rubin (1974) - formalizes a framework for understanding causation from a statistical perspective.

    • Inspired by earlier Neyman (1923) and Fisher (1935) on randomized experiments.
  • We’ll spend most of our time with this approach, often called the Rubin Causal Model or potential outcomes framework.

  • Core idea:

    • Causal effects are effects of interventions
    • Causal effects are contrasts in counterfactuals
  • The potential outcomes framework clarifies:

    1. What action is doing the causing?
    2. Compared to what alternative action?
    3. On what outcome metric?
    4. How would we learn about the effect from data?

Statistical setup.

  • Population of units
    • Finite population or infinite super-population
  • Sample of \(N\) units from the population indexed by \(i\)
  • Observed outcome \(Y_i\)
  • Binary treatment indicator \(D_i\).
    • Units receiving “treatment”: \(D_i = 1\)
    • Units receiving “control”: \(D_i = 0\)
  • Covariates (observed prior to treatment) \(X_i\)

Potential outcomes

  • Let \(D_i\) be the value of a treatment assigned to each individual.
  • \(Y_i(d)\) is the value that the outcome would take if \(D_i\) were set to \(d\).
    • For binary \(D_i\): \(Y_i(1)\) is the value we would observe if unit \(i\) were treated.
    • \(Y_i(0)\) is the value we would observe if unit \(i\) were under control
  • We model the potential outcomes as fixed attributes of the units.
  • Notation alert! – Sometimes you’ll see potential outcomes written as:
    • \(Y_i^1\), \(Y_i^0\) or \(Y_i^{d=1}\), \(Y_i^{d=0}\)
    • \(Y_{i0}\), \(Y_{i1}\)
    • \(Y_1(i)\), \(Y_0(i)\)
  • Causal effects are contrasts in potential outcomes.
    • Individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)
    • Can consider ratios or other transformations (e.g. \(\frac{Y_i(1)}{Y_i(0)}\))

Consistency/SUTVA

  • How do we link the potential outcomes to observed ones?

  • Consistency/Stable Unit Treatment Value (SUTVA) assumption

    \[Y_i(d) = Y_i \text{ if } D_i = d\]

  • Sometimes you’ll see this w/ binary \(D_i\) (often in econometrics)

    \[Y_i = Y_i(1)D_i + Y_i(0)(1-D_i)\]

  • Implications

    1. No interference - other units’ treatments don’t affect \(i\)’s potential outcomes.
    2. Single version of treatment
    3. \(D\) is in principle manipulable - a “well-defined intervention”
    4. The means by which treatment is assigned is irrelevant (a version of 2)

Positivity/Overlap

  • We also need some assumptions on the treatment assignment mechanism \(D_i\).

  • In order to be able to observe some units’ values of \(Y_i(1)\) or \(Y_i(0)\) treatment can’t be deterministic. For all \(i\):

    \[ 0 < Pr(D_i = 1) < 1 \]

  • If no units could ever receive treatment or control it would be impossible to learn about \(\mathbb{E}[Y_i | D_i = 1]\) or \(\mathbb{E}[Y_i | D_i = 0]\)

  • This is sometimes called a positivity or overlap assumption.

    • Pretty trivial in a randomized experiment, but can be tricky in observational studies when \(D_i\) is perfectly determined by some covariates \(X_i\)

A missing data problem

  • It’s useful to think of the causal inference problem in terms of missingness in the complete table of potential outcomes.
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(5\) ? \(5\)
\(2\) \(0\) ? \(-3\) \(-3\)
\(3\) \(1\) \(9\) ? \(9\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(N\) \(0\) ? \(8\) \(8\)
  • If we could observe both \(Y_i(1)\) and \(Y_i(0)\) for each unit, then this would be easy!
  • But we can’t - we only observe what we’re given by \(D_i\)
  • Holland (1986) calls this “The Fundamental Problem of Causal Inference”

Causal Estimands

  • The individual causal effect: \(\tau_i\) (can’t identify this w/o strong assumptions!)

    \[\tau_i = Y_i(1) - Y_i(0)\]

  • The sample average treatment effect (SATE): \(\tau_s\)

    \[\tau_s = \frac{1}{N}\sum_{i=1}^N Y_i(1) - Y_i(0)\]

  • The population average treatment effect (PATE) \(\tau_p\)

    \[\tau_p = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]\]

Sample vs. Population Estimands

  • With the SATE and PATE, we’ve made an important distinction between two sources of uncertainty
    • Random assignment of treatment (unobserved P.O.s)
    • Sampling from a population.
  • Even if we’re just interested in the treatment effect within our sample, there’s still uncertainty
  • When can we go from SATE to PATE?
    • If we have a random sample from the target population
    • If there are no sources of effect heterogeneity that differ between sample and target population
    • We’ll spend Week 3 talking about this problem - external validity

Causal vs. Associational Estimands

Causal Identification

  • Causal identification: Can we learn about the value of a causal effect from the observed data?
    • Can we express the causal estimand (e.g. \(\tau_p = \mathbb{E}[Y_i(1) - Y_i(0)]\)) entirely in terms of observable quantities?
  • Causal identification comes prior to questions of estimation
    • It doesn’t matter whether you’re using regression, weighting, matching, doubly-robust estimation, double-LASSO, etc…
    • If you can’t answer the question “What’s your identification strategy?” then no amount of fancy stats will solve your problems.
  • Identification requires assumptions about the connection between the observed data \(Y_i\), \(D_i\) and the unobserved counterfactuals \(Y_i(d)\)
    • (e.g.) Under what assumptions will the observed difference-in-means identify the average treatment effect?

Identifying the ATT

  • Suppose we want to identify the (population) Average Treatment Effect on the Treated (ATT)

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) | D_i = 1]\]

  • Let’s see what our consistency/SUTVA assumption gets us!

  • First, let’s use linearity:

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

  • Next, consistency

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

Identifying the ATT

  • Still not enough though. We have an unobserved term \(\mathbb{E}[Y_i(0) | D_i = 1]\). Why can’t we observe this directly?

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\]

  • Let’s see what the difference would be between the ATT and the simple difference-in-means \(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\). Add and subtract \(\mathbb{E}[Y_i | D_i = 0]\)

    \[\tau_{\text{ATT}} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] + \mathbb{E}[Y_i | D_i = 0]\]

  • Rearranging terms

    \[\tau_{\text{ATT}} = \bigg(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg) - \bigg(\mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg)\]

Identifying the ATT

  • Now we have an expression for the ATT in terms of the difference-in-means and a bias term

    \[\tau_{\text{ATT}} = \underbrace{\bigg(\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\bigg)}_{\text{Difference-in-means}} - \underbrace{\bigg(\mathbb{E}[Y_i(0) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]\bigg)}_{\text{Selection-into-treatment bias}}\]

  • What does this bias term represent? How can we interpret it?

    • How much higher are the potential outcomes under control for units that receive treatment vs. those that receive control.
    • Sometimes called a selection-into-treatment problem - units that choose treatment may have higher or lower potential outcomes than those that choose control.
  • Can do the same analysis for the average treatment effect under control (ATC) and by extension the average treatment effect

Selection-into-treatment bias

  • Can use theory to “sign the bias”:
    • Suppose \(Y_i\) was an indicator of whether someone voted in an election and \(D_i\) was an indicator for whether they received a political mailer.
    • Consider a world where the mailer was sent out non-randomly to everyone who had signed up for a politician’s mailing list.
    • If we took the difference in turnout rates between voters who received the mailer and voters who did not receive the mailer, would we be over-estimating or under-estimating the effect of treatment?

Ignorability/Unconfoundedness

  • Suppose we want point identification and not just bounds on the causal effect?

    • What assumption can we make such that the difference-in-means identifies the ATT (or ATE)?
  • We assume: the selection-into-treatment bias is \(0\)

    \[\mathbb{E}[Y_i(0) | D_i = 1] = \mathbb{E}[Y_i(0) | D_i = 0]\]

    \[\mathbb{E}[Y_i(1) | D_i = 1] = \mathbb{E}[Y_i(1) | D_i = 0]\]

  • This will be true if treatment is independent of the potential outcomes.

    \[\{Y_i(1), Y_i(0)\} {\perp \! \! \! \perp} D_i\]

  • Common names for this assumption: exogeneity, unconfoundedness, ignorability

    • In simple terms: Treatment is not systematically more/less likely to be assigned to units that have higher/lower potential outcomes.

Ignorability/Unconfoundedness

  • The difference-in-means identifies the average treatment effect under three assumptions:

    1. Consistency/SUTVA
    2. Positivity/Overlap
    3. Ignorability/Unconfoundedness
  • Consistency gives us:

    \[\mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] = \mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]\]

  • And ignorability gives us:

    \[\mathbb{E}[Y_i(1) | D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] = \tau\]

Ignorability/Unconfoundedness

Experiments

Randomized Experiments

  • What sort of research design justifies ignorability?

    • A randomized experiment!
  • An experiment is any study where a researcher knows and controls the treatment assignment probability \(Pr(D_i = 1)\)

  • A randomized experiment is an experiment that satisfies:

    • Positivity: \(0 < Pr(D_i = 1) < 1\) for all units
    • Ignorability: \(Pr(D_i = 1| \mathbf{Y}(1), \mathbf{Y}(0)) = Pr(D_i = 1)\)
      • Another implication of \(\mathbf{Y}(1), \mathbf{Y}(0) {\perp \! \! \! \perp} D_i\)
      • Treatment assignment probabilities do not depend on the potential outcomes.

Types of experiments

  • Lots of ways in which we could design a randomized experiment where ignorability holds:
  • Let \(N_t\) be the number of treated units, \(N_c\) number of controls
  • Bernoulli randomization:
    • Independent coin flips for each \(D_i\). \(Pr(D_i = 1) = p\)
    • \(D_i {\perp \! \! \! \perp} D_j\) for all \(i\), \(j\).
    • \(N_t\), \(N_c\) are random variables
  • Complete randomization
    • Fix \(N_t\) and \(N_c\) in advance. Randomly select \(N_t\) units to be treated.
    • Each unit has an equal probability to be treated.
    • Each assignment with \(N_t\) treated units is equally likely to occur
    • \(D_i\) is independent of potential outcomes, but treatment assignment is slightly dependent across units.

Types of experiments

  • Stratified randomization
    • Using covariates \(X_i\), form \(J\) total blocks or strata of units with similar or identical covariate values.
    • Completely randomize within each of the \(J\) blocks
    • In the limit, pair-randomization w/ strata of size \(2\).
  • Cluster randomization
    • Each unit \(i\) belongs to some larger cluster. \(C_i = \{1, 2, \dotsc, C\}\), \(C < N\).
    • Treatment is assigned by complete randomization at the cluster level
      • Randomly select some number of clusters to be treated, remainder get control.
    • If units share cluster membership, they get the same treatment (\(C_i = C_j \leadsto D_i = D_j\))

Complete randomization

  • How do we do estimation and inference under complete randomization?

    • We’ll start with the finite-sample setting and illustrate the Neyman (1923) approach to inference for the SATE.
  • Define our quantity of interest, the sample average treatment effect

    \[\tau_{\text{s}} = \frac{1}{N}\sum_{i=1}^N Y_i(1) - Y_i(0)\]

  • Our estimator is the sample difference-in-means.

    \[\hat{\tau} = \frac{1}{N_t} \sum_{i=1}^N Y_i D_i - \frac{1}{N_c} \sum_{i=1}^N Y_i (1 - D_i)\]

Finite sample inference

  • Consider a study with \(N_t = 3\), \(N_c = 3\) and suppose we could see the true “table of science”
    • Under one realization of the treatment \(\mathbf{D}\), we have:
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(0\) \(1\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(0\) \(0\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)
  • For this assignment, our realization of \(\hat{\tau}\) (our estimate) would be:

    \[\frac{1 + 1 + 1}{3} - \frac{1 + 1 + 0}{3} = \frac{1}{3}\]

Finite sample inference

  • How about another, equally likely realization?
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(0\) \(1\) \(0\) \(0\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(1\) \(0\) \(1\) \(0\)
\(5\) \(1\) \(0\) \(0\) \(0\)
\(6\) \(0\) \(1\) \(1\) \(1\)
  • For this randomization, our realization of \(\hat{\tau}\) would be:

    \[\frac{1 + 0 + 0}{3} - \frac{0 + 1 + 1}{3} = -\frac{1}{3}\]

Finite sample inference

  • Overall all possible randomizations, what is the distribution?
    • We can run a quick simulation and find out
### Define the data frame
data <- data.frame(Y_1 = c(1,0,1,0,0,1), 
                   Y_0 = c(0, 1, 0, 1, 0, 1))

## Simulate the sampling distribution
nIter = 10000
sate_est = rep(NA, nIter)
set.seed(53703)
for(i in 1:nIter){
  data$D = sample(rep(c(0,1), each=3))
  data$Y = data$D*data$Y_1 + (1-data$D)*data$Y_0
  sate_est[i] = mean(data$Y[data$D==1]) - mean(data$Y[data$D==0])
}

Finite sample inference

  • First, what’s the expectation of our estimator?
mean(sate_est)
[1] -0.00387
  • Next, what’s the variance?
var(sate_est)
[1] 0.0661

Finite sample inference

Finite sample inference

  • Of course, in real data, we only get one estimate.
    • Need to rely on theory to understand the distribution that estimate came from in order to do inference.
  • Is \(\hat{\tau}|\mathbf{Y}(1), \mathbf{Y}(0)\) unbiased for the SATE?
    • Under complete randomization: Yes!
  • What is the sampling variance \(Var(\hat{\tau})\) under a finite sample (fixed \(\mathbf{Y}(1)\), \(\mathbf{Y}(0)\))?
    • Surprisingly, depends on the amount of effect heterogeneity
  • What should our estimator of the sampling variance \(\widehat{Var(\hat{\tau})}\) be?
    • Our conventional “difference-in-means” variance estimator is conservative
    • Sadly, we can’t leverage the effect heterogeneity part (w/o more assumptions)!
    • Fundamental problem of causal inference strikes again!

Unbiasedness

  • Let’s show \(\hat{\tau}\) is unbiased for the SATE. First, by linearity of expectations:

    \[\mathbb{E}[\hat{\tau} | \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N \mathbb{E}\bigg[Y_i D_i \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N \mathbb{E}\bigg[Y_i (1 - D_i) \bigg| \mathbf{Y}(1), \mathbf{Y}(0) \bigg]\]

  • By consistency \(Y_iD_i = Y_i(1)D_i\) and \(Y_i(1-D_i) = Y_i(0)(1-D_i)\)

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N \mathbb{E}\bigg[Y_i(1) D_i \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N \mathbb{E}\bigg[Y_i(0) (1 - D_i) \bigg| \mathbf{Y}(1), \mathbf{Y}(0)\bigg]\]

  • Conditional on the potential outcomes, \(Y_i(1)\) and \(Y_i(0)\) are constants

    \[\mathbb{E}[\hat{\tau}| \mathbb{Y}(1), \mathbb{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N Y_i(1) \mathbb{E}\bigg[ D_i\bigg| \mathbb{Y}(1), \mathbb{Y}(0)\bigg] - \frac{1}{N_c} \sum_{i=1}^N Y_i(0) \mathbb{E}\bigg[(1 - D_i) \bigg| \mathbb{Y}(1), \mathbb{Y}(0)\bigg]\]

Unbiasedness

  • \(D_i\) has a known distribution under complete randomization and its expectation is \(Pr(D_i = 1)\), which is just \(N_t/N\)

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N_t} \sum_{i=1}^N Y_i(1) \frac{N_t}{N} - \frac{1}{N_c} \sum_{i=1}^N Y_i(0) \frac{N_c}{N}\]

  • Pulling out the constants

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0)\]

  • And we have the SATE!

    \[\mathbb{E}[\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)] = \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0) = \tau_s\]

Sampling variance

  • What’s the variance of \(\hat{\tau}\) going to be (conditional on the sample)? Slightly tricky since \(D_i\) is not independent of \(D_j\).

\[Var\bigg(\hat{\tau}| \mathbf{Y}(1), \mathbf{Y}(0)\bigg) = \frac{S^2_t}{N_t} + \frac{S^2_c}{N_c} - \frac{S^2_{\tau_i}}{N}\]

  • The outcome variances are:

    \[S_t^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg(Y_i(1) - \bar{Y(1)}\bigg)^2\] \[S_c^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg(Y_i(0) - \bar{Y(0)}\bigg)^2\]

  • And the third term is the sample variance of the treatment effects

    \[S_{\tau_i}^2 = \frac{1}{N-1} \sum_{i=1}^N \bigg((Y_i(1) - Y_i(0))- (\bar{Y(1)} - \bar{Y(0)}) \bigg)^2\]

Sampling variance

  • Can we estimate the sampling variance?

    • Well \(S^2_t\) and \(S^2_c\) can be estimated from their sample analogues (just the sample variances within treated/control groups)

    \[s_t^2 = \frac{1}{N_t-1} \sum_{i:D_i = 1} \bigg(Y_i(1) - \bar{Y_t^{\text{obs}}}\bigg)^2\]

    \[s_c^2 = \frac{1}{N_c-1} \sum_{i:D_i = 0} \bigg(Y_i(0) - \bar{Y_c^{\text{obs}}}\bigg)^2\]

  • But…we can’t estimate \(S^2_{\tau_i}\) directly from the sample!

    • The fundamental problem of causal inference! Can’t observe individual treatment effects.

Neyman variance

  • Neyman suggested just ignoring that third term and using our familiar estimator

\[\widehat{\mathbb{V}}_{\text{Neyman}} = \frac{s_t^2}{N_t} + \frac{s_c^2}{N_c}\]

  • What are its properties?
    • We know it’s conservative since \(S_{\tau_i}^2 \ge 0\).
    • Confidence intervals using the Neyman standard error \(\sqrt{\widehat{\mathbb{V}}_{\text{Neyman}}}\) will be no smaller than they should be.
    • If treatment effects are constant, it’s unbiased!

Neyman variance

  • Why do we see a difference between the true variance and the Neyman variance?
    • Let’s go back to our \(N=6\) example!
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(0\) \(1\)
\(2\) \(0\) \(0\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(0\) \(1\)
\(4\) \(0\) \(0\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)

Neyman variance

  • The variance from our simulation was…
var(sate_est)
[1] 0.0661
  • We can verify our exact variance calculation from before…
[1] 0.0667
  • But if we ignore the variance of the treatment effects…
var(data$Y_1)/3 + var(data$Y_0)/3
[1] 0.2

Neyman variance

  • When is the Neyman variance estimator unbiased/consistent for the true variance of \(\hat{\tau}\)?
    • When effects are constant
    • …or when we’re targeting the population ATE (under a “random sampling” assumption)
  • Intuition: With random sampling from a target population, can think of treated group and control group as two separate \(N_t\) and \(N_c\) size samples from the population \(Y_i(1)\) and \(Y_i(0)\) respectively.

Illustration: Gerber, Green and Larimer (2008)

  • Gerber, Green and Larimer (2008) want to know what causes people to vote.
    • What sorts of encouragements will get people to turn out more or less?
  • Five treatment conditions in a randomized GOTV mailer experiment:
    • No mailer (0)
    • “Researchers will be studying your turnout” mailer (Hawthorne) (1)
    • “Voting is a civic duty” mailer (Civic Duty) (2)
    • “Your and your neighbors’ voting history” mailer (Neighbors) (3)
    • “Your turnout history” mailer (Self) (4)
  • Gerber, Green and Larimer first analyze households. Why?
    • Is \(Y_i(d)\) well-defined for an individual? Somewhat tricky - likely spillovers across household members.
    • Treatment is randomized by household.

# Illustration: Gerber, Green and Larimer (2008)

# Load the data
data <- read_dta('assets/data/ggr_2008_individual.dta')

# Aggregate to the household level
data_hh <- data %>% group_by(hh_id) %>% summarize(treatment = treatment[1], voted = mean(voted))

# For each treatment condition, calculate N and share voting
hh_means <- data_hh %>% group_by(treatment) %>% summarize(N = n(), voted = mean(voted))
kable(hh_means)
treatment N voted
0 99999 0.304
1 20002 0.332
2 20001 0.325
3 20000 0.389
4 20000 0.357

Illustration: Gerber, Green and Larimer (2008)

  • Let’s estimate the ATE of the “Neighbors” treatment relative to control
# Estimated ATE of Neighbors (3) vs. Control (0)
ate <- mean(data_hh$voted[data_hh$treatment == 3]) -
  mean(data_hh$voted[data_hh$treatment == 0])
ate
[1] 0.0848
  • And let’s compute the Neyman variance
# Estimate the sampling variance
var_ate = var(data_hh$voted[data_hh$treatment == 3])/sum(data_hh$treatment == 3) +
  var(data_hh$voted[data_hh$treatment == 0])/sum(data_hh$treatment == 0)

# Square root to get estimated SE
sqrt(var_ate)
[1] 0.0034

Illustration: Gerber, Green and Larimer (2008)

  • 95% asymptotic confidence interval and p-value against null of no ATE.
# Confidence interval (assuming asymptotic normality)
ate_95CI = c(ate - qnorm(.975)*sqrt(var_ate),
  ate + qnorm(.975)*sqrt(var_ate))
ate_95CI
[1] 0.0781 0.0915
# P-value H_0: \tau = 0, H_a: \tau \neq 0
p_val = 2*pnorm(-abs(ate/sqrt(var_ate)))
p_val
[1] 3.64e-137

Illustration: Gerber, Green and Larimer (2008)

  • Fun fact: You can get this via OLS regression!
    1. OLS with a single binary regressor is just the difference-in-means.
    2. The classic OLS standard errors impose too many additional assumptions (homoskedasticity)
    3. The usual “robust” SEs are close to the Neyman variance in large samples…
    4. …but Samii and Aronow (2012) show that the Neyman variance is exactly equal to robust SEs w/ HC2 correction
  • That’s the default for lm_robust in the estimatr package!
lm_robust(voted ~ I(treatment==3), data=data_hh %>% filter(treatment == 3|treatment == 0))
                      Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper
(Intercept)             0.3043    0.00132   230.2  0.00e+00   0.3017   0.3069
I(treatment == 3)TRUE   0.0848    0.00340    24.9 8.13e-137   0.0781   0.0915
                          DF
(Intercept)           119997
I(treatment == 3)TRUE 119997

Fisher’s Exact Test

Fisher’s Exact Test

  • Neyman’s framework:
    • Estimand: Average Treatment Effect: \(\tau = E[Y_i(1) - Y_i(0)]\)
    • Difference-in-means estimator: Known expectation and sampling variance
    • Hypothesis test. Null of no ATE \(H_0: \tau = 0\)
    • Large sample/asymptotic theory to get the distribution of \(\hat{\tau}\) to calculate p-values
  • Fisher’s framework:
    • Can we get a p-value under (some) null hypothesis without the large-sample assumptions?
    • What does randomization alone justify?
    • Exact p-values under a sharp null of no individual treatment effect \(Y_i(1) = Y_i(0)\) for all \(i\).
    • “Randomization test”: a flexible framework for inference under any known randomization scheme

Hypothesis testing review

  • Four steps to conducting a hypothesis test
  1. Define the null hypothesis.
  • Neyman framework: \(H_0: \tau = \tau_0\)
  • The probability of observing a particular value of the test statistic depends on what is “true” about the underlying parameter.
  • Our thought experiment: If the null were true, how likely would we see what we observe (or more extreme).
  1. Choose a test statistic
  • In classical hypothesis testing, we pick something that has useful statistical properties:

    \[T = (\hat{\tau} - \tau_0)/\sqrt{\widehat{Var({\hat{\tau}})}}\]

Hypothesis testing review

  1. Determine the distribution of the test statistic under the null
  • In classical testing, by CLT, in large samples, \(T \sim \mathcal{N}(0, 1)\)
  • In smaller samples, you may have made further assumptions (e.g. outcome is normally distributed) to show that \(T\) follows a \(t-\) distribution
  • We need a distribution to get probabilities!
  1. What is the probability of observing the test statistic \(T\) that you observe in-sample (or a more extreme value) given the known distribution under the null?
    • That’s a p-value!

What’s different about randomization testing?

  • Randomization tests are a hypothesis test but with 2 main differences from our usual approach
  1. Different null hypothesis (a “sharp” null)
  2. No assumptions/asymptotics to derive the distribution of \(T\)
  • We can instead literally just calculate the value under each possible realization of treatment.
  • And we know the distribution of treatment assignments because we control them in an experiment.

Sharp null of no effect.

  • The sharp null hypothesis states: \[H_0: \tau_i = Y_i(1) - Y_i(0) = 0 \text{ } \forall i\]

  • Sharp null implies zero ATE

    • But zero ATE does not imply sharp null!
  • Why do we make this assumption?

    • Because now the observed data tells us everything we need to know about the potential outcomes

The sharp null

  • Remember our table of science? For a single realization of \(\mathbf{D}\) we only observe half the potential outcomes?
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) ? \(1\)
\(2\) \(0\) ? \(1\) \(1\)
\(3\) \(1\) \(1\) ? \(1\)
\(4\) \(0\) ? \(1\) \(1\)
\(5\) \(0\) ? \(0\) \(0\)
\(6\) \(1\) \(1\) ? \(1\)
  • But what does the sharp null imply about the unobserved potential outcomes? Can we fill in those question marks?

The sharp null

  • Yes! Under the sharp null, \(Y_i(1) = Y_i(0)\)
Unit \(i\) Treatment \(D_i\) \(Y_i(1)\) \(Y_i(0)\) Observed \(Y_i\)
\(1\) \(1\) \(1\) \(1\) \(1\)
\(2\) \(0\) \(1\) \(1\) \(1\)
\(3\) \(1\) \(1\) \(1\) \(1\)
\(4\) \(0\) \(1\) \(1\) \(1\)
\(5\) \(0\) \(0\) \(0\) \(0\)
\(6\) \(1\) \(1\) \(1\) \(1\)
  • Why is this useful?
    • We can calculate the value of our test statistic not only under the observed \(\mathbf{D}\) but under all other possible realizations of \(\mathbf{D}\)!

The test statistic

  • Our test statistic is any function of treatment assignments \(\mathbf{D}\) and observed outcomes \(\mathbf{Y}\).

  • Lots of choices with different degrees of power for different kinds of treatment effects.

  • We want to pick a test statistic that will return large values when the null is false and small values when it is true.

  • A reasonable default: (absolute) difference-in-means

    \[t(\mathbf{D}, \mathbf{Y}) = \bigg|\frac{1}{N_t} \sum_{i=1}^N Y_i D_i - \frac{1}{N_c} \sum_{i=1}^N Y_i (1 - D_i) \bigg|\]

  • What sorts of alternatives might this be bad for?

    • Offsetting positive and negative effects will return small values of \(t(\mathbf{D}, \mathbf{Y})\) as would no effects.
    • We might pick a different test statistic in this case!

The randomization distribution

  • Under the sharp null, we can calculate \(t(\mathbf{D}, \mathbf{Y})\) for every possible realization of \(\mathbf{D}\).

    • Why? Because under the sharp null, observed \(Y_i\) is unaffected by treatment assignment.
  • Then to get a distribution for \(t(\mathbf{D}, \mathbf{Y})\), we just need to know the distribution of \(\mathbf{D}\). We know this by designing the experiment!

  • We get our p-value by comparing the observed test statistic for our particular sample \(t^*\) to the distribution of \(t(\mathbf{D}, \mathbf{Y})\)

    • For complete randomization, each of the \(K\) possible realizations of \(\mathbf{D}\) is equally likely, so we just enumerate all possible assignments \(\mathbf{d} \in \Omega\) and calculate the share that are greater than our observed test statistic.

    \[Pr(t^* \ge t(\mathbf{D}, \mathbf{Y})) = \frac{\sum_{\mathbf{d} \in \Omega} \mathbb{I}(t(\mathbf{d}, \mathbf{Y}) \ge t^*)}{K}\]

  • This is our p-value, which we compare to some threshold level \(\alpha\) and reject the null when it’s below that level.

Monte carlo approximation

  • We could enumerate every possible treatment vector and actually just calculate \(t(\mathbf{D}, \mathbf{Y})\).
  • Even in fairly small samples this can involve quite a lot of computations! (e.g. \({20 \choose 10 } = 184756\))
  • We’ll typically use a monte carlo approximation to the exact p-value.
    • This is also easier for more complicated randomization schemes.
  • Procedure:
    • For \(K\) iterations:
      1. Draw a realization of the treatment vector \(\mathbf{d}_k\) from the known distribution of \(\mathbf{D}\).
      2. Calculate the test statistic \(t_k = t(\mathbf{d}_k, \mathbf{Y})\)
    • Our p-value is the share of these \(K\) test statistics that are greater than the observed \(t^*\)

Putting it all together

  • To do randomization inference under the sharp null
    1. Choose a test statistic
    2. Calculate the observed test statistic in your sample \(t^* = t(\mathbf{D}, \mathbf{Y})\)
    3. Draw another treatment vector \(\mathbf{d}_1\) from the known distribution of \(\mathbf{D}\)
    4. Calculate \(t_1 = t(\mathbf{d}_1, \mathbf{Y})\)
    5. Repeat 3 and 4 as long as you want to get \(K\) samples from the distribution of the test statistic under the null
    6. Calculate \(p = \frac{1}{K}\sum_{i=1}^K \mathbb{I}(t_k \ge t^*)\)

Inverting tests to get confidence intervals

  • Randomization tests alone give us p-values but no confidence intervals.
  • One approach: “invert” the test - for what values of a “treatment effect” would we fail to reject the null
    • A \(100(1-\alpha)\%\) confidence interval contains the set of parameter values for which an \(\alpha\)-level hypothesis test would fail to reject the null.
  • Slight complication: We now need to actually define a “treatment effect” parameter:
    • For example, assume a constant additive effect for all units

      \[Y_i(1) - Y_i(0) = \tau_0\]

    • Our confidence set would be all of the values of \(\tau_0\) for which we’d fail to reject the null

    • Calculate via a grid search through possible values of \(\tau_0\).